For this project, we will follow the DCOVAC process. The process is listed below:
DCOVAC – THE DATA MODELING FRAMEWORK
For this project there is I am trying to analyze data regarding cars and their price. I am presented with many different variables regarding each car instance and my aim is to use these variables to predict the price of the car, try to analyze if there are any significant correlations or significant variables for which price is susceptible based on these.
This data set has 86 rows and 35 columns/variables. For this analysis, we will ignore the several columns/variables found in the raw data csv. These will be later on droped in the code bellow (after running all the raw data first. These variables include:
car_ID variable which it is only an ID for each row of
data… CarBrand variable which was used to create the
CarHonda, CarNissan, CarSubaru, CarToyota, and CarVolvo column
predictors (ones and zeros)… aspiration variable/column
used to create the aspirationDV column (ones and zeros)…
doornumber defined by all the variables that justify for a
cars dimensions and also because it lacked variability…
doornumber column used to create cb_hatchback, cb_sedan,
and cb_wagon (ones and zeros)… drivewheel lacked
variability… enginetype lacked variability…
cylindernumber lacked variaibility… fuelsystem
lacked variability… Validation.Column we are not validating
anything in these R models…
I used this data set for my Automating Business Processes class which I found on Kaggel.
VARIABLES TO PREDICT WITH
VARIABLES WE WANT TO PREDICT
Summary Statistics of variables after data cleaning.
CarHonda CarNissan CarSubaru CarToyota
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.1512 Mean :0.2093 Mean :0.1395 Mean :0.3721
3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
CarVolvo aspirationDV cb_hatchback cb_sedan
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :1.0000 Median :0.0000 Median :0.0000
Mean :0.1279 Mean :0.8953 Mean :0.3372 Mean :0.4302
3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
cb_wagon wheelbase carlength carwidth
Min. :0.0000 Min. : 86.60 Min. :144.6 Min. :62.50
1st Qu.:0.0000 1st Qu.: 94.50 1st Qu.:166.3 1st Qu.:63.90
Median :0.0000 Median : 96.50 Median :172.0 Median :65.20
Mean :0.1744 Mean : 97.84 Mean :171.8 Mean :65.33
3rd Qu.:0.0000 3rd Qu.:100.40 3rd Qu.:176.2 3rd Qu.:66.50
Max. :1.0000 Max. :109.10 Max. :188.8 Max. :68.90
carheight curbweight enginesize boreratio stroke
Min. :49.70 Min. :1713 Min. : 79.0 Min. :2.910 Min. :2.360
1st Qu.:52.60 1st Qu.:2084 1st Qu.: 97.0 1st Qu.:3.150 1st Qu.:3.030
Median :54.00 Median :2303 Median :108.0 Median :3.270 Median :3.270
Mean :53.96 Mean :2439 Mean :119.1 Mean :3.332 Mean :3.196
3rd Qu.:54.90 3rd Qu.:2929 3rd Qu.:141.0 3rd Qu.:3.620 3rd Qu.:3.410
Max. :59.10 Max. :3296 Max. :181.0 Max. :3.780 Max. :3.580
compressionratio horsepower peakrpm citympg
Min. : 7.500 Min. : 55.00 Min. :4200 Min. :17.00
1st Qu.: 9.000 1st Qu.: 70.00 1st Qu.:4800 1st Qu.:24.00
Median : 9.000 Median : 86.00 Median :5200 Median :27.00
Mean : 9.829 Mean : 96.52 Mean :5105 Mean :26.86
3rd Qu.: 9.400 3rd Qu.:114.00 3rd Qu.:5275 3rd Qu.:30.75
Max. :23.000 Max. :200.00 Max. :6600 Max. :49.00
highwaympg PriceDV price
Min. :22.00 Min. :0.0000 Min. : 5118
1st Qu.:28.00 1st Qu.:0.0000 1st Qu.: 7378
Median :32.00 Median :0.0000 Median : 9022
Mean :32.09 Mean :0.3488 Mean :10598
3rd Qu.:36.75 3rd Qu.:1.0000 3rd Qu.:12944
Max. :54.00 Max. :1.0000 Max. :22625
From this data we can see that our variables have a variety of different values based on their types. Refer to raw data table and “The Data” section of this report as to why we removed certain variables.
# A tibble: 2 × 2
PriceDV n
<chr> <int>
1 0 56
2 1 30
We can see we have about 65% of the data as bellow the average car price(<$10597.89535). Looking at the potential predictors related to PriceDV, we strongest relationships between car dimensions.
We can clearly see the data is skewed to the right. The majority of cars are around the 6,000 and 7,000 price point. This plus the distribution of cars being smaller than the average show that the data is made up of a small amount of cars that have very high prices. This migh indicate that brand might be a very significant factor. This or either there are very few cars that have very good performance and therefore cost more— internal aspects of the car such as variables like stroke, boreratio, engine, etc.
For this analysis we will use a Linear Regression Model.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| curbweight | 5.982 | 1.350 | 4.430 | 0.000 |
| CarToyota | -4634.349 | 1098.125 | -4.220 | 0.000 |
| CarHonda | -6025.816 | 1736.620 | -3.470 | 0.001 |
| CarNissan | -4638.980 | 1338.104 | -3.467 | 0.001 |
| boreratio | -5034.059 | 1511.270 | -3.331 | 0.001 |
| aspirationDV | -2664.047 | 876.788 | -3.038 | 0.003 |
| carheight | -431.312 | 151.348 | -2.850 | 0.006 |
| CarSubaru | -2592.273 | 1125.102 | -2.304 | 0.024 |
| compressionratio | -132.135 | 81.290 | -1.625 | 0.109 |
| enginesize | 35.812 | 25.550 | 1.402 | 0.166 |
| peakrpm | 0.826 | 0.604 | 1.367 | 0.176 |
| carwidth | 404.780 | 301.913 | 1.341 | 0.185 |
| horsepower | -21.399 | 19.577 | -1.093 | 0.278 |
| stroke | 1580.243 | 1497.927 | 1.055 | 0.295 |
| cb_hatchback | -568.378 | 731.532 | -0.777 | 0.440 |
| cb_wagon | -556.112 | 987.078 | -0.563 | 0.575 |
| citympg | -72.978 | 134.963 | -0.541 | 0.591 |
| highwaympg | 39.551 | 118.022 | 0.335 | 0.739 |
| wheelbase | 33.626 | 105.999 | 0.317 | 0.752 |
| (Intercept) | 5140.413 | 22802.692 | 0.225 | 0.822 |
| cb_sedan | -161.685 | 832.910 | -0.194 | 0.847 |
| carlength | -9.249 | 53.788 | -0.172 | 0.864 |
After examining this model, we determine that there are some predictors that are not important in predicting car price, so a pruned version of the model is created by removing predictors that are not significant.
For this analysis we will use a pruned Linear Regression Model. We removed all variables with a p-value greater than the alpha of 0.05.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| curbweight | 7.994 | 0.501 | 15.963 | 0.000 |
| CarSubaru | -4627.111 | 672.571 | -6.880 | 0.000 |
| CarToyota | -4907.649 | 722.095 | -6.796 | 0.000 |
| carheight | -528.978 | 87.845 | -6.022 | 0.000 |
| CarNissan | -4211.202 | 764.531 | -5.508 | 0.000 |
| (Intercept) | 33954.195 | 6967.955 | 4.873 | 0.000 |
| CarHonda | -4714.447 | 967.634 | -4.872 | 0.000 |
| aspirationDV | -1912.489 | 553.532 | -3.455 | 0.001 |
| boreratio | -2562.356 | 1124.592 | -2.278 | 0.025 |
After examining this model, looking at the residual plots we can see that there are some particularities with our data. The high values at the right of the Q-Q plot show that there are some data instances that have very extreme values. We also can see a small curve at the beginning of the Residuals vs Fitted - therefore there is a pattern to the average price bellow or above being predicted better for values with a smaller price (where there is more concentration of data). Therefore, we could transform the data or try some additional models to try to improve this fit.
This is somewhat expected. If you think about it, most cars are within a certain price range but then there are luxury car brands that are become outlines. In this case, all brands used are somewhat considered “equal” in a sense, except for Volvo that is considered more luxurious. This also explains why all the brands that turned out to be significant (all except for Volvo), decrease the average price, that which Volvo increases significantly since it is more luxurious.
Reducing the predictors did not help with prediction of car price and did not have a big impact our fit statistics (R-square and RMSE (root mean squared error)). It reduced r-squared by 1% and increased RMSE.
From the following table, we can see the effect on car price by the end predictor variables.
| Variable | Direction |
|---|---|
| curbweight | Increase |
| CarSubaru | Decrease |
| CarToyota | Decrease |
| carheight | Decrease |
| CarNissan | Decrease |
| CarHonda | Decrease |
| aspirationDV | Decrease |
| boreratio | Decrease |
In Conclusion, we can see that our predictors do help to predict car price.
Combining the results of both types of predictor models and only reporting where agreement was found, we can see that as these variables increase they:| Decrease_Car_Price | Increase_car_price |
|---|---|
| CarSubaru | curbweight |
| CarToyota | curbweight |
| carheight | curbweight |
| CarNissan | curbweight |
| CarHonda | curbweight |
| aspirationDV | curbweight |
| boreratio | curbweight |
---
title: "Car Price Project"
output:
flexdashboard::flex_dashboard:
vertical_layout: scroll
source_code: embed
---
```{r setup, include=FALSE, warning=FALSE}
#include=FALSE will not include r code in output
#warning=FALSE will remove any warnings from output
library(flexdashboard)
library(tidyverse)
library(GGally)
library(caret) #for logistic regression
library(broom) #for tidy() function
```
```{r load_data}
df <- read_csv("C:/Users/Luca.Buselli/Desktop/CarPriceData.csv")
```
Introduction {data-orientation=rows}
=======================================================================
Row {data-height=250}
-----------------------------------------------------------------------
### Overview
For this project, we will follow the DCOVAC process. The process is listed below:
DCOVAC – THE DATA MODELING FRAMEWORK
* DEFINE the Problem
* COLLECT the Data from Appropriate Sources
* ORGANIZE the Data Collected
* VISUALIZE the Data by Developing Charts
* ANALYZE the data with Appropriate Statistical Methods
* COMMUNICATE your Results
Row {data-height=1050}
-----------------------------------------------------------------------
### The Problem & Data Collection
#### The Problem
For this project there is I am trying to analyze data regarding cars and their price. I am presented with many different variables regarding each car instance and my aim is to use these variables to predict the price of the car, try to analyze if there are any significant correlations or significant variables for which price is susceptible based on these.
#### The Data
This data set has 86 rows and 35 columns/variables. For this analysis, we will ignore the several columns/variables found in the raw data csv. These will be later on droped in the code bellow (after running all the raw data first. These variables include:
`car_ID` variable which it is only an ID for each row of data...
`CarBrand` variable which was used to create the CarHonda, CarNissan, CarSubaru, CarToyota, and CarVolvo column predictors (ones and zeros)...
`aspiration` variable/column used to create the aspirationDV column (ones and zeros)...
`doornumber` defined by all the variables that justify for a cars dimensions and also because it lacked variability...
`doornumber` column used to create cb_hatchback, cb_sedan, and cb_wagon (ones and zeros)...
`drivewheel` lacked variability...
`enginetype` lacked variability...
`cylindernumber` lacked variaibility...
`fuelsystem` lacked variability...
`Validation.Column` we are not validating anything in these R models...
#### Data Sources
I used this data set for my Automating Business Processes class which I found on Kaggel.
### The Data
VARIABLES TO PREDICT WITH
* *CarHonda*: if car is Honda then 1, else 0.
* *CarNissan*: if car is Nissan then 1, else 0.
* *CarSubaru*: if car is Subaru then 1, else 0.
* *CarToyota*: if car is Toyota then 1, else 0.
* *CarVolvo*: if car is Volvo then 1, else 0.
* *aspirationDV*: Type of engine, whether it is turbo (1) or standard (0).
* *cb_hatchback*: if car is hatchback then 1, else 0.
* *cb_sedan*: if car is sedan then 1, else 0.
* *cb_wagon*: if car is wagon then 1, else 0.
* *wheelbase*: the distance between the center of the front and rear wheels.
* *carlength*: The length of the car.
* *carwidth*: The width of the car.
* *carheight*: The height of the car.
* *curbweight*: The weight of car including a full tank of fuel and standard equipment.
* *enginesize*: Size of engine.
* *boreratio*: The bore-stroke ratio is the ratio of bore to stroke. A ratio of 1:1 is referred to informally as square. Your engine's bore-stroke ratio plays a vital part in the way it produces power. The bore-stroke ratio and the cylinder volume affect the exhaust emissions and the fuel consumption of an engine.
* *stroke*: A stroke is when the piston moves from the top center to the bottom of the cylinder.
* *compressionratio*: Compression ratios usually range from 8:1 to 10:1. A higher compression ratio -- say, from 12:1 to 14:1 -- means higher combustion efficiency.
* *horsepower*: Horsepower refers to the power an engine produces. It's calculated through the power needed to move 550 pounds one foot in one second or by the power needs to move 33,000 pounds one foot in one minute.
* *peakrpm*: RPM stands for revolutions per minute, and it's used as a measure of how fast any machine is operating at a given time.
* *citympg and highwaympg*: MPG stands for miles per gallon and it refers to a common measurement of fuel economy - the average number of miles a vehicle can drive on one gallon of fuel. The higher the MPG, the more economically friendly a car is.
VARIABLES WE WANT TO PREDICT
* *price*: price of each car row instance
* *PriceDV*: categorical value of price where 1 > average car price, else 0
Data
=======================================================================
Column {data-width=650}
-----------------------------------------------------------------------
### Organize the Data
Summary Statistics of variables after data cleaning.
```{r, cache=TRUE}
#the cache=TRUE can be removed. This will allow you to rerun your code without it having to run EVERYTHING from scratch every time. If the output seems to not reflect new updates, you can choose Knit, Clear Knitr cache to fix.
#Clean data by replacing spaces with decimals
colnames(df) <- make.names(colnames(df))
#remove RAD due to it being an index so not a real continuous number
df <- select(df,-car_ID)
df <- select(df,-fueltype)
df <- select(df,-aspiration)
df <- select(df,-doornumber)
df <- select(df,-carbody)
df <- select(df,-drivewheel)
df <- select(df,-enginetype)
df <- select(df,-cylindernumber)
df <- select(df,-fuelsystem)
df <- select(df,-Validation.Column)
df <- select(df,-CarBrand)
#View data
summary(df)
```
From this data we can see that our variables have a variety of different values based on their types. Refer to raw data table and "The Data" section of this report as to why we removed certain variables.
Column {data-width=350}
-----------------------------------------------------------------------
### Transform Variables
The PriceDV column is a categorical variable that is 1 if price > average price, else 0. We will convert this categorical variables to be factors.
```{r, cache=TRUE}
df <- mutate(df,PriceDV=as.factor(PriceDV))
```
#### PriceDV Distribution
```{r, cache=TRUE}
as_tibble(select(df,PriceDV) %>%
table())
```
#### PriceDV (Above or Bellow average car price)
<!--Instructions to import .jpg or .png images
use getwd() to see current path structure
copy file into same place as .Rmd file
put the path to this file in the link
format:  -->

Data Viz #1
=======================================================================
Column {data-width=500}
-----------------------------------------------------------------------
### Response Variables
#### PriceDV Above(1)/Bellow(0)
```{r, cache=TRUE}
as_tibble(select(df,PriceDV) %>%
table()) %>%
ggplot(aes(y=n,x=PriceDV)) + geom_bar(stat="identity")
```
We can see we have about 65% of the data as bellow the average car price(<$10597.89535). Looking at the potential predictors related to PriceDV, we strongest relationships between car dimensions.
Column {data-width=500}
-----------------------------------------------------------------------
### Transform Variables
```{r, cache=TRUE}
ggpairs(select(df, price, PriceDV, wheelbase, carlength, carwidth, carheight, curbweight))
```
Data Viz #2
=======================================================================
Column {data-width=500}
-----------------------------------------------------------------------
### Response Variables
#### price
```{r, cache=TRUE}
ggplot(df, aes(price)) + geom_histogram(bins=20)
```
We can clearly see the data is skewed to the right. The majority of cars are around the 6,000 and 7,000 price point. This plus the distribution of cars being smaller than the average show that the data is made up of a small amount of cars that have very high prices. This migh indicate that brand might be a very significant factor. This or either there are very few cars that have very good performance and therefore cost more— internal aspects of the car such as variables like stroke, boreratio, engine, etc.
Column {data-width=500}
-----------------------------------------------------------------------
### Transform Variables
```{r, cache=TRUE}
ggpairs(select(df,price,PriceDV,enginesize,boreratio,stroke,compressionratio,horsepower))
```
price Analysis {data-orientation=rows}
=======================================================================
Row
-----------------------------------------------------------------------
### Predict Car Price
For this analysis we will use a Linear Regression Model.
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
price_lm <- lm(price ~ . -PriceDV,data = df)
summary(price_lm)
```
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(price_lm)
```
### Adjusted R-Squared
```{r, cache=TRUE}
ARSq<-round(summary(price_lm)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-up")
```
### RMSE
```{r, cache=TRUE}
Sig<-round(summary(price_lm)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-up")
```
Row
-----------------------------------------------------------------------
### Regression Output
```{r,include=FALSE, cache=TRUE}
#knitr::kable(summary(MEDV_lm)$coef, digits = 3) #pretty table output
summary(price_lm)$coef
```
```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(price_lm))[,4])
out <- coef(summary(price_lm))[idx,]
knitr::kable(out, digits = 3) #pretty table output
```
### Residual Assumptions Explorations
```{r, cache=TRUE}
plot(price_lm, which=c(1,2)) #which tells which plots to show (1-6 different plots)
```
Row
-----------------------------------------------------------------------
### Analysis Summary
After examining this model, we determine that there are some predictors that are not important in predicting car price, so a pruned version of the model is created by removing predictors that are not significant.
Row
-----------------------------------------------------------------------
### Predict Car Price Final Version
For this analysis we will use a pruned Linear Regression Model. We removed all variables with a p-value greater than the alpha of 0.05.
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
price_lm <- lm(price ~ . -PriceDV -carlength -cb_sedan -wheelbase -highwaympg -citympg -cb_wagon -cb_hatchback -stroke -horsepower -carwidth -peakrpm -enginesize -compressionratio,data = df)
summary(price_lm)
```
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(price_lm)
```
### Adjusted R-Squared
```{r, cache=TRUE}
ARSq<-round(summary(price_lm)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-up")
```
### RMSE
```{r, cache=TRUE}
Sig<-round(summary(price_lm)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-up")
```
Row
-----------------------------------------------------------------------
### Regression Output
```{r, include=FALSE, cache=TRUE}
knitr::kable(summary(price_lm)$coef, digits = 3) #pretty table output
```
```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(price_lm))[,4])
out <- coef(summary(price_lm))[idx,]
knitr::kable(out, digits = 3) #pretty table output
```
### Residual Assumptions Explorations
```{r, cache=TRUE}
plot(price_lm, which=c(1,2)) #which tells which plots to show (1-6 different plots)
```
Row
-----------------------------------------------------------------------
### Analysis Summary
After examining this model, looking at the residual plots we can see that there are some particularities with our data. The high values at the right of the Q-Q plot show that there are some data instances that have very extreme values. We also can see a small curve at the beginning of the Residuals vs Fitted - therefore there is a pattern to the average price bellow or above being predicted better for values with a smaller price (where there is more concentration of data). Therefore, we could transform the data or try some additional models to try to improve this fit.
This is somewhat expected. If you think about it, most cars are within a certain price range but then there are luxury car brands that are become outlines. In this case, all brands used are somewhat considered "equal" in a sense, except for Volvo that is considered more luxurious. This also explains why all the brands that turned out to be significant (all except for Volvo), decrease the average price, that which Volvo increases significantly since it is more luxurious.
Reducing the predictors did not help with prediction of car price and did not have a big impact our fit statistics (R-square and RMSE (root mean squared error)). It reduced r-squared by 1% and increased RMSE.
From the following table, we can see the effect on car price by the end predictor variables.
```{r, cache=TRUE}
#create table summary of predictor changes
predchang = data_frame(
Variable = c('curbweight', 'CarSubaru','CarToyota','carheight','CarNissan','CarHonda','aspirationDV','boreratio'),
Direction = c('Increase','Decrease','Decrease','Decrease', 'Decrease','Decrease','Decrease','Decrease')
)
knitr::kable(predchang) #pretty table output
```
PriceDV Analysis {data-orientation=rows}
=======================================================================
Row {data-height=900}
-----------------------------------------------------------------------
### Predict PriceDV




Conclusion
=======================================================================
### Summary
In Conclusion, we can see that our predictors do help to predict car price.
Combining the results of both types of predictor models and only reporting where agreement was found, we can see that as these variables increase they:
```{r}
#final table summary of predictor changes
predchangfnl = data.frame(
Decrease_Car_Price = c("CarSubaru",
"CarToyota",
"carheight",
"CarNissan",
"CarHonda",
"aspirationDV",
"boreratio"),
Increase_car_price = c("curbweight"))
knitr::kable(predchangfnl) #pretty table output
```